library(ggplot2)
library(dplyr)
library(ggmap)
library(tidyr)
library(cluster)
library(lubridate)
library(rgl)
library(maptools)
library(ggpubr)
library(viridis)
library(tidyverse)
library(scatterplot3d)
library(factoextra)
library(fpc)
library(NbClust)
library(reshape2)
library(scales)
library(plyr)
library(caret)
library(ROCR)
library(pROC)
library(rlist)
library(Hmisc)
library(corrplot)
library(ggcorrplot)
library(DMwR)
library(ggthemes)
library(e1071)
library(maps)
library(RColorBrewer)
library(rgl)

model_dir = "models"
data_dir = "data"
map_dir = "map"
saved_maps = list.files(map_dir)
### Loading data
data=read.csv(paste(data_dir,"MCI_2014_to_2017.csv",sep="/"), header = TRUE, sep = ",")
### Loading map
for(file in saved_maps) {
  load(paste(map_dir,file,sep="/"))
}

I. Introduction

The second assignment for the York Machine Learning course, Machine Learning in Business Context was to explore unsupervised machine learning algorithms, specifically clustering. We chose a dataset from the Toronto Police Sercive Public Safety Data Portal, MCI 2014 to 2017. This report follows the structures laid out in CRISP-DM methodology.

The GitHub repository for all the source code is located at the following link: link here.

The RShiny app is located at the following link: link here.

II. Business Understanding

The Toronto Police Service is the police force that serves the Greater Toronto Area. It is the largest municipal police force in Canada and the third largest police force in Canada. They are a taxpayer-funded service, ranking as second for the government of Toronto’s budgetary expenses. There is always

The objective of this model is to cluster crimes to determine which areas of Toronto have the most levels of crime, overall and for different Major Crime Indicators. This would enable the Toronto Police to most effectively allocate their officers and specialists to the areas that require them the most. The hope is that this would be an effective way to lower crime rates and enable more cases to be solved, all without spending more money.

There are some ethical implications of using crime data. There are many avenues for bias to enter the data set. Police services around North America have come under increased scrutiny in recent years for racist policing. Some police policies or laws inherently disadvantage certain groups of people, which would create bias in the data. This means that conclusions drawn from a machine learning model based on biased data would create biased results. The conclusions should be looked at with other data, such as demographic data, and supplementary information, such as social considerations.

III. Data Understanding

The data set was provided courtesy of the Toronto Police Service Open Data Portal. It is usable in accordance with the Open Government License - Ontario.

The data concerns all Major Crime Indicators (MCI) occurrences in Toronto by reported date, between 2014 and 2017. The MCIs are Assault, Break and Enter, Auto Theft, and Theft Over (excludes Sexual Assaults). Locations in the data set are approximate, as they have been deliberately offset to the nearest road intersection to protect the privacy of involved individuals.

There are 29 columns with 131,073 observations. However, 4 of these columns are duplicates, bringing us to 25 columns. Most of the columns

There are both numeric and categorical attributes and are further divided by date-type data and crime-type data.

The following table shows the numeric attributes:

Attribute Description
… …

While this table shows the categorical attributes:

Attribute Description
… …

IV. Data Exploration and Preparation

(2) Visualizations for the division and neighbourhood attributes

Visualization crime level for all neighbourhoods.

shpfile <- paste(data_dir,"NEIGHBORHOODS_WGS84_2.shp",sep="/")
sh <- readShapePoly(shpfile)
## Warning: readShapePoly is deprecated; use rgdal::readOGR or sf::st_read
sh@data$AREA_S_CD <- as.integer(sh@data$AREA_S_CD)
total_offence_cnt_table = data %>% group_by(Hood_ID) %>% dplyr::summarise(offence_cnt = n())
hood_total_offence_cnt_table = merge(total_offence_cnt_table,sh@data,by.x='Hood_ID',by.y='AREA_S_CD')
points_offense_cnt <- fortify(sh, region = 'AREA_S_CD')
points_offense_cnt <- merge(points_offense_cnt, hood_total_offence_cnt_table, by.x='id', by.y='Hood_ID', all.x=TRUE)
torontoMap + geom_polygon(aes(x=long,y=lat, group=group, fill=offence_cnt), data=points_offense_cnt, color='black') +
  scale_fill_distiller(palette='Spectral') + scale_alpha(range=c(0.5,0.5))

Visualization Toronto Crimes Heatmap across neighboughhoods

base_size <- 9
    heat_group <- group_by(toronto, Neighbourhood, offence)
    heat_count <- dplyr::summarise(heat_group,Total = n())
    heat_count$Neighbourhood <- with(heat_count,reorder(Neighbourhood,Total))
    heat_count.m <- melt(heat_count)
## Using Neighbourhood, offence as id variables
    heat_count.m <- ddply(heat_count.m, .(variable), transform,rescale = rescale(value))
    ggplot(heat_count.m, aes(Neighbourhood, offence)) + 
        geom_tile(aes(fill = rescale),colour = "white") + 
        ggtitle("Toronto Criminal Heatmap") +
        scale_fill_gradient(low = "lightblue",high = "darkblue") +
        theme_grey(base_size = base_size) + 
        labs(x = "", y = "") + 
        scale_x_discrete(expand = c(0, 0)) +
        scale_y_discrete(expand = c(0, 0)) +
        theme_minimal() +
        theme(legend.position = "none",axis.ticks = element_blank(), axis.text.x = element_text(size = base_size *0.8, angle = 330, hjust = 0, colour = "grey50"))

We also can visualize some interesting information from the dataset e.g.: Finding the top dangerous or top safe zones, neighbourhoods in Toronto based on all MCI or on a specific MCI. The following charts display top 5 dangerous and top 5 safe neighbourhoods in Toronto on total MCI. For more options on other top neighbourhoods/divisions on total MCI or for a specific MCI, please explore on our application.

location_group <- group_by(toronto, Neighbourhood)
crime_by_location <- dplyr::summarise(location_group, n=n())
crime_dangerous <- crime_by_location[order(crime_by_location$n, decreasing = TRUE), ]
crime_dangerous_top <- head(crime_dangerous, 5)
ggplot(aes(x = reorder(Neighbourhood, n), y = n, fill = reorder(Neighbourhood, n)), data = crime_dangerous_top) +
          geom_bar(stat = 'identity', width = 0.6) +
          ggtitle("Top 5 dangerous neighbourhoods in Toronto") +
          coord_flip() +
          xlab('Zone') +
          ylab('Number of Occurrences') +
          scale_fill_brewer(palette = "Reds") +
          theme(plot.title = element_text(size = 16),
                axis.title = element_text(size = 12, face = "bold"),
                legend.position="none")

crime_safe <-  crime_by_location[order(crime_by_location$n, decreasing = FALSE), ]
crime_safe_top <- head(crime_safe, 5)
ggplot(aes(x = reorder(Neighbourhood, -n), y = n, fill = reorder(Neighbourhood, -n)), data = crime_safe_top) +
        geom_bar(stat = 'identity', width = 0.6) +
        ggtitle("Top 5 safe neighbourhoods in Toronto") +
        coord_flip() +
        xlab('Zone') +
        ylab('Number of Occurrences') +
        scale_fill_brewer(palette = "Greens") +
        theme(plot.title = element_text(size = 16),
              axis.title = element_text(size = 12, face = "bold"),
              legend.position="none")

### (3) Data outliers It seems like someone reported crimes over 40 years ago. But since nothing stops people from doing this, we are not going to treat these as outliers.

data$occurrencedate <- ymd(gsub("(.*)T.*", "\\1", data$occurrencedate))
data$reporteddate <- ymd(gsub("(.*)T.*", "\\1", data$reporteddate))
data[which(data$occurrencedate < as.POSIXct("1970-01-01")),]
##            ï..X        Y Index_ event_unique_id occurrencedate
## 1257  -79.53082 43.64304    147   GO-2015636294     1966-06-09
## 11889 -79.44592 43.68811   5156  GO-20143212768     1968-01-01
##       reporteddate premisetype ucr_code ucr_ext offence reportedyear
## 1257    2015-04-17   Apartment     1430     100 Assault         2015
## 11889   2014-10-31       House     1430     100 Assault         2014
##       reportedmonth reportedday reporteddayofyear reporteddayofweek
## 1257          April          17               107        Friday    
## 11889       October          31               304        Friday    
##       reportedhour occurrenceyear occurrencemonth occurrenceday
## 1257            15             NA                            NA
## 11889           11             NA                            NA
##       occurrencedayofyear occurrencedayofweek occurrencehour     MCI
## 1257                   NA                                  0 Assault
## 11889                  NA                                  9 Assault
##       Division Hood_ID                   Neighbourhood      Lat      Long
## 1257       D22      14 Islington-City Centre West (14) 43.64304 -79.53082
## 11889      D13     107           Oakwood Village (107) 43.68811 -79.44592
##        FID
## 1257   257
## 11889 5889

The rest of the data is mostly categorical in their nature, so no concerns need to be addressed in terms of data outliers here.

(4) Data Preprocessing

The following code documents the preprocessing steps idendified for this dataset.

#reload the data
data <- read.csv(paste(data_dir,"MCI_2014_to_2017.csv",sep="/"), header = TRUE, sep = ",")

First, we want to remove any duplicate data - rows or columns. Some events have duplicated event IDs and should be removed. We also have duplicate columns for X/Y and Lat/Long, which should be removed. We are don’t use the UCR codes or the ID numbers, so they’re also removed.

#remove duplicate rows/entries
data <- subset(data, !duplicated(data$event_unique_id))

#remove columns that aren't used/duplicates
data <- data[, !colnames(data) %in% c("?..X","Y","Index_","event_unique_id","ucr_code","ucr_ext","FID")]

Next, we format the dates. There are garbage time values at the end of the dates, which are removed. The other date values are also changed into appropriate date/time values. Whitespace is also present in the day of week columns, so that is trimmed.

#formatting dates - remove garbage time values at the end
data$occurrencedate <- gsub("(.*)T.*", "\\1", data$occurrencedate)
data$reporteddate <- gsub("(.*)T.*", "\\1", data$reporteddate)
data$occurrencetime = ymd_h(paste(data$occurrencedate,data$occurrencehour,sep = " "), tz="EST")
data$reportedtime = ymd_h(paste(data$reporteddate,data$reportedhour,sep = " "), tz="EST")
data$occurrencedate = ymd(data$occurrencedate)
data$reporteddate = ymd(data$reporteddate)

#removing whitespace from day of week
data$occurrencedayofweek <- as.factor(trimws(data$occurrencedayofweek, "b"))
data$reporteddayofweek <- as.factor(trimws(data$reporteddayofweek, "b"))

We also convert the month/day of week from string representation to integer representation:

data$reportedmonth = month(data$reportedtime)
data$reporteddayofweek = wday(data$reportedtime)
data$occurrencemonth = month(data$occurrencetime)
data$occurrencedayofweek = wday(data$occurrencetime)

Now let’s take a look at the missing data:

#missing data
colSums(is.na(data))
##                ï..X      occurrencedate        reporteddate 
##                   0                   0                   0 
##         premisetype             offence        reportedyear 
##                   0                   0                   0 
##       reportedmonth         reportedday   reporteddayofyear 
##                   0                   0                   0 
##   reporteddayofweek        reportedhour      occurrenceyear 
##                   0                   0                  32 
##     occurrencemonth       occurrenceday occurrencedayofyear 
##                   0                  32                  32 
## occurrencedayofweek      occurrencehour                 MCI 
##                   0                   0                   0 
##            Division             Hood_ID       Neighbourhood 
##                   0                   0                   0 
##                 Lat                Long      occurrencetime 
##                   0                   0                   0 
##        reportedtime 
##                   0
NAdata <- unique (unlist (lapply (data, function (x) which (is.na (x)))))

Rows with NA data:

NAdata
##  [1]   921   922   923   964   986  1036  1037  1086  1087  1088  1157
## [12]  1158  1159  1211  1228  1508  1509  1510  1689  1690  1691  1692
## [23]  3330  3331 10372 10373 13858 13859 13860 13861 95190 98356

Occurence dates for rows with NA data:

data$occurrencedate[NAdata]
##  [1] "1989-01-01" "1998-01-01" "1991-01-01" "1998-01-01" "1999-10-01"
##  [6] "1998-01-01" "1996-01-01" "1966-06-09" "1980-04-24" "1970-01-01"
## [11] "1999-01-01" "1996-01-01" "1992-12-21" "1996-01-31" "1998-06-01"
## [16] "1974-01-01" "1995-01-01" "1990-01-01" "1988-01-01" "1988-01-01"
## [21] "1973-08-31" "1999-03-24" "1999-01-01" "1982-03-13" "1968-01-01"
## [26] "1995-01-01" "1994-01-01" "1989-01-01" "1987-01-01" "1989-11-20"
## [31] "1996-01-01" "1978-04-10"

We can see that there are 32 missing values, all in occurrence date related columns. Upon further inspection, these are all the same rows. We can also see that the occurrence date value is correct, so these date type columns can have their missing values imputed.

#imputing occurence dates from occurence date field
data$occurrenceyear[NAdata] <- year(data$occurrencedate[NAdata])
data$occurrencemonth[NAdata] <- month(data$occurrencedate[NAdata], label = TRUE, abbr = FALSE)
data$occurrenceday[NAdata] <- day(data$occurrencedate[NAdata])
data$occurrencedayofweek[NAdata] <- wday(data$occurrencedate[NAdata], label = TRUE, abbr = FALSE)
data$occurrencedayofyear[NAdata] <- yday(data$occurrencedate[NAdata])

Now we replace the space in the strings with an underscore for easier processing:

#replace space in string
data$offence <- gsub("\\s", "_", data$offence)
data$MCI <- gsub("\\s", "_", data$MCI)

Next, all columns are converted into factors except Lat, Long, reportedtime, and occurrencetime. Unused factor levels are also dropped (resulted from missing values).

#change things to factors
for(col in c("offence","MCI","Division","Hood_ID")) {
  data[,col] = as.factor(data[,col])
}

#if we use the gower distance and daisy() function, the following metrics can be considered to converted to ordered factor; but since gower distance turns out to be too expensive for large dataset, we have treated the following as normal factors as well!
for(col in c("reportedyear","reportedmonth","reportedday","reporteddayofyear","reporteddayofweek",
             "reportedhour","occurrenceyear","occurrencemonth","occurrenceday","occurrencedayofyear",
             "occurrencedayofweek","occurrencehour")) {
  data[,col] = factor(data[,col])
}

#drop unused factor levels
for(col in names(data)) {
  if(is.factor(data[,col])) {
    data[,col] = droplevels(data[,col])
  }
}

Finally, we cherck for missing data one last time:

# Check missing values one last time
colSums(is.na(data))
##                ï..X      occurrencedate        reporteddate 
##                   0                   0                   0 
##         premisetype             offence        reportedyear 
##                   0                   0                   0 
##       reportedmonth         reportedday   reporteddayofyear 
##                   0                   0                   0 
##   reporteddayofweek        reportedhour      occurrenceyear 
##                   0                   0                   0 
##     occurrencemonth       occurrenceday occurrencedayofyear 
##                   0                   0                   0 
## occurrencedayofweek      occurrencehour                 MCI 
##                   0                   0                   0 
##            Division             Hood_ID       Neighbourhood 
##                   0                   0                   0 
##                 Lat                Long      occurrencetime 
##                   0                   0                   0 
##        reportedtime 
##                   0

V. Modeling

With the data prepared, we can now start looking at models.

(1) Clustering strategy I.

First, we can look at clustering crime by neighborhood. We need to coerce the data into a clusterable table, sorted by MCI and neighborhood.

# Neighbourhood first
#first, coerce the data into a table that can be clustered - we aren't interested in the occurence date at this point
#courtesy of Susan Li - https://datascienceplus.com/exploring-clustering-and-mapping-torontos-crimes/
bygroup <- group_by(data, MCI, Hood_ID)
groups <- dplyr::summarise(bygroup, n=n())
groups <- groups[c("Hood_ID", "MCI", "n")]
hood <- as.data.frame(spread(groups, key=MCI, value=n))
hood_id = as.integer(hood[,"Hood_ID"])
hood = hood[,-1]
head(hood)
##   Assault Auto_Theft Break_and_Enter Robbery Theft_Over
## 1     925       1091             501     243        187
## 2     867        203             132     302         14
## 3     203         59              84      73         10
## 4     233         74              61      57          4
## 5     206         53              56      66          4
## 6     389        175             144      73         19

Then we can use this table to perform k-means clustering. First, we need to normalize the data and determine the number of clusters. To determine the number of clusters, we simply plot the within-cluster sum-of-squares and pick a number after inspecting this plot.

hood <- scale(hood)
#determine number of clusters
wssplot <- function(data, nc=15, seed=1234) {
  wss <- (nrow(data)-1)*sum(apply(data,2,var))
  for (i in 2:nc) {
    set.seed(seed)
    wss[i] <- sum(kmeans(data, centers=i)$withinss)
    }
  plot(1:nc, wss, type="b", xlab="Number of Clusters",
       ylab="Within groups sum of squares")
}

#we can see there's an elbow around 3 clusters
wssplot(hood, nc=15)

Now it seems number 3 is a good choice of the number of clusters. We then proceed to perform K-means clustering on the data-set. We have followed this tutorial(https://www.datanovia.com/en/lessons/cluster-validation-statistics-must-know-methods/) to evaluate the quality of clustering results. Specifically, we have used Silhouette width as a quantitative measures to evaluate the clustering quality. As we will see in the Silhouette plot, both in cluster #2 and cluster #3 have data points where the Silhouette width falls to negative, indicating a not so ideal clustering. This can also be observed in the 1st plot where we see that cluster #2 and cluster #3 is quite close to each other.

# K-means clustering
km.res <- eclust(hood, "kmeans", k = 3, graph = T)

fviz_silhouette(km.res, palette = "jco", ggtheme = theme_classic())
##   cluster size ave.sil.width
## 1       1   10          0.01
## 2       2   37          0.21
## 3       3   93          0.59

Let’s look at the cluster means from the outputs to see what we can learn.

# k-means results
km.res
## K-means clustering with 3 clusters of sizes 10, 37, 93
## 
## Cluster means:
##      Assault Auto_Theft Break_and_Enter    Robbery Theft_Over
## 1  2.5319589  1.6064456       2.5015651  2.3608452  2.9465633
## 2  0.5229260  0.3375853       0.6012997  0.5174595  0.2914680
## 3 -0.4802995 -0.3070442      -0.5082122 -0.4597253 -0.4327952
## 
## Clustering vector:
##   [1] 1 2 3 3 3 3 3 3 3 3 3 3 3 1 3 3 2 3 3 3 2 3 3 2 2 2 1 3 3 3 2 3 3 3 3
##  [36] 3 3 3 2 2 3 3 3 3 2 3 2 3 3 2 2 3 3 3 3 3 3 3 3 3 3 2 3 3 3 3 3 3 3 2
##  [71] 3 3 1 2 1 1 1 1 2 3 3 2 3 3 3 2 3 3 3 3 3 3 2 3 1 3 3 2 3 3 3 3 3 3 3
## [106] 3 3 3 3 3 2 3 2 3 3 3 2 3 2 2 2 2 3 2 3 2 2 2 3 2 2 2 3 3 3 2 1 2 3 3
## 
## Within cluster sum of squares by cluster:
## [1] 153.40984  68.48510  45.97407
##  (between_SS / total_SS =  61.5 %)
## 
## Available components:
## 
##  [1] "cluster"      "centers"      "totss"        "withinss"    
##  [5] "tot.withinss" "betweenss"    "size"         "iter"        
##  [9] "ifault"       "clust_plot"   "silinfo"      "nbclust"     
## [13] "data"

We can see that cluster 3 has lower than average crime. It has 93 neighborhoods, meaning that the majority of Toronto is quite safe in terms of the number of incidents occurring. Cluster 2 has slightly above average crime numbers, and it only has 37 neighborhoods. 10 neighborhoods have much higher than average crime numbers, which is seen in cluster 3. In some way, we can interpret the results as such that crime is concentrated in these small pockets of Toronto.

As an interesting activity, we also looking at a 3D representation of the clustering results. With 3 principal components, more variations should be captured that as in the case in 2 principal components. With a third dimension, we can see that there is more of a spread than initially can be seen simply in two dimensions.

#3D plot
pc <-princomp(hood, cor=TRUE, scores=TRUE)
plot3d(pc$scores[,1:3], col=km.res$cluster, main="k-means clusters")

Next, we can look at the neighborhoods from a hierarchical clustering approach. Again, we need to determine how many clusters we want to have. For hierarchical clustering, we look at the total number of observations that end up in different clusters with different configurations(in this case, the configuration is the number of clusters).

counts <- sapply(2:6, function(ncl) eclust(hood, "hclust", k = ncl, hc_metric = "euclidean", hc_method = "ward.D2")$size)
names(counts) <- 2:6
counts
## $`2`
## [1]  10 130
## 
## $`3`
## [1] 10 44 86
## 
## $`4`
## [1]  1 44 86  9
## 
## $`5`
## [1]  1 44 86  7  2
## 
## $`6`
## [1]  1 15 86  7 29  2

Still, we can see that 3 clusters are not bad if we consider only the cluster size as the criteria for choosing the number of clusters. We don’t want to split the neighborhoods into clusters with a small number of neighborhoods.

Now that we have chosen the number of clusters, we proceed to the clustering process. As previously in k-means, we look at Silhouette width as a quantitative measure to evaluate the clustering quality. To visually inspect the clusters, we plot dendrograms.

# Hierarchical clustering
hc.res <- eclust(hood, "hclust", k = 3, hc_metric = "euclidean", 
                 hc_method = "ward.D2")

# Visualize dendrograms
fviz_dend(hc.res, show_labels = FALSE,
         palette = "jco", as.ggplot = TRUE)

fviz_silhouette(hc.res, palette = "jco", ggtheme = theme_classic())
##   cluster size ave.sil.width
## 1       1   10          0.04
## 2       2   44          0.16
## 3       3   86          0.61

It seems like we have reached the same conlusion as the k-means algorithms: cluster #1 and cluster #2 does not have large gap. The following codes plot the clustering results on the map. We also plot the offence count in each neiborhood and it can be seen that the 2nd plot is highly correlated with the 1st plot (which makes sense). Map with kMean Clustering v.s Manual Cluster Map

cluster_ids <- km.res$cluster
hood_ids_and_cluster_ids <- data.frame(cbind(hood_id,cluster_ids))
hood_ids_and_cluster_ids$cluster_ids = as.factor(hood_ids_and_cluster_ids$cluster_ids)
shpfile <- paste(data_dir,"NEIGHBORHOODS_WGS84_2.shp",sep="/")
sh <- readShapePoly(shpfile)
## Warning: readShapePoly is deprecated; use rgdal::readOGR or sf::st_read
sh@data$AREA_S_CD <- as.integer(sh@data$AREA_S_CD)

hood_name_and_cluster_ids = merge(hood_ids_and_cluster_ids,sh@data,by.x='hood_id',by.y='AREA_S_CD')
points_clustering <- fortify(sh, region = 'AREA_S_CD')
points_clustering <- merge(points_clustering, hood_name_and_cluster_ids, by.x='id', by.y='hood_id', all.x=TRUE)

p1 = torontoMap + geom_polygon(aes(x=long,y=lat, group=group, fill=cluster_ids), data=points_clustering, color='black') +
  scale_fill_brewer(palette = "Set2")
p2 = torontoMap + geom_polygon(aes(x=long,y=lat, group=group, fill=offence_cnt), data=points_offense_cnt, color='black') +
  scale_fill_distiller(palette='Spectral') + scale_alpha(range=c(0.5,0.5))
p3 = ggarrange(p1, p2, ncol = 2, nrow = 1, common.legend = F)
print(p3)

Map with Hierarchical Clustering v.s Manual Cluster Map

cluster_ids <- hc.res$cluster
hood_ids_and_cluster_ids <- data.frame(cbind(hood_id,cluster_ids))
hood_ids_and_cluster_ids$cluster_ids = as.factor(hood_ids_and_cluster_ids$cluster_ids)
shpfile <- paste(data_dir,"NEIGHBORHOODS_WGS84_2.shp",sep="/")
sh <- readShapePoly(shpfile)
## Warning: readShapePoly is deprecated; use rgdal::readOGR or sf::st_read
sh@data$AREA_S_CD <- as.integer(sh@data$AREA_S_CD)

hood_name_and_cluster_ids = merge(hood_ids_and_cluster_ids,sh@data,by.x='hood_id',by.y='AREA_S_CD')
points_clustering <- fortify(sh, region = 'AREA_S_CD')
points_clustering <- merge(points_clustering, hood_name_and_cluster_ids, by.x='id', by.y='hood_id', all.x=TRUE)

p1 = torontoMap + geom_polygon(aes(x=long,y=lat, group=group, fill=cluster_ids), data=points_clustering, color='black') +
  scale_fill_brewer(palette = "Set2")
p2 = torontoMap + geom_polygon(aes(x=long,y=lat, group=group, fill=offence_cnt), data=points_offense_cnt, color='black') +
  scale_fill_distiller(palette='Spectral') + scale_alpha(range=c(0.5,0.5))
p3 = ggarrange(p1, p2, ncol = 2, nrow = 1, common.legend = F)
print(p3)

### (2) Clustering strategy II. We can perform clustering on lat/long to look at natural crime hotspots. We can compare this to a manual cluster like the Toronto police divisions.

latlong <- data[, colnames(data) %in% c("Lat", "Long")]

# k-means 
# 34 divisions in Toronto
k.means.fit <- kmeans(latlong, 34)
torclus <- as.data.frame(k.means.fit$centers)
torclus$size <- k.means.fit$size
latlongclus <- latlong
latlongclus$cluster <- as.factor(k.means.fit$cluster)
tormap <- get_map(location =c(left=-79.8129, bottom=43.4544, right=-78.9011, top=43.9132))
## Warning: bounding box given to google - spatial extent only approximate.
## converting bounding box to center/zoom specification. (experimental)
## Source : https://maps.googleapis.com/maps/api/staticmap?center=43.6838,-79.357&zoom=10&size=640x640&scale=2&maptype=terrain&language=en-EN
#plot each incident, color-coded by cluster
ggmap(tormap) +
  geom_point( data= latlongclus, aes(x=Long[], y=Lat[], color= as.factor(cluster))) +
  theme_void() + coord_map() 
## Coordinate system already present. Adding new coordinate system, which will replace the existing one.

#plot bubble map with cluster centroids, size/ color determined by the number of incidents in each cluster
ggmap(tormap) +
  geom_point( data= torclus, aes(x=Long[], y=Lat[], size=size)) +
  theme_void() + coord_map() 
## Coordinate system already present. Adding new coordinate system, which will replace the existing one.

We can see how the clusters look with both the incidents plotted as well as the centroids. Next, we can perform k-means clustering on the natural crime hotspot clusters and on the Toronto Police divisions to see how they compare to one another.

#coerce data for Hotspots
data2 <- data
data2$cluster <- k.means.fit$cluster
bygroup <- group_by(data2, MCI, cluster)
groups <- dplyr::summarise(bygroup, n=n())
groups <- groups[c("cluster", "MCI", "n")]
hotspot <- data.frame(spread(groups, key=MCI, value=n))
hotspot <- hotspot[, -1]
hotspot = data.frame(scale(hotspot))
#determine number of clusters
#we can see there's an elbow around 4 clusters
wssplot(hotspot, nc=15)

We can see here that there is an elbow around 4, so we can run k means with 4 clusters. With these 4 clusters, we can see that cluster 4 has higher than average auto thefts, and slightly higher robberies, but is below average for the other 3 MCIs. Cluster 1 is a safe part of Toronto with lower crime incidents. Cluster 3 has slightly more assaults, break and enters, and robberies, but lower auto thefts and theft overs. Cluster 2, however, is the most crime-centric area of Toronto, with many more incidents than average, excepting auto thefts. Luckily, there are only 3 hotspots in cluster 2, while there are 12 and 13 in clusters 1 and 3 respectively, which had generally fewer incidents overall.

From the silhouette plot, we can see that clustering based on this data set is better as compared to Clustering strategy I.

km.res <- eclust(hotspot, "kmeans", k = 4, graph = T)

fviz_silhouette(km.res, palette = "jco", ggtheme = theme_classic())
##   cluster size ave.sil.width
## 1       1   16          0.22
## 2       2    4          0.22
## 3       3    4          0.24
## 4       4   10          0.46

km.res
## K-means clustering with 4 clusters of sizes 16, 4, 4, 10
## 
## Cluster means:
##       Assault Auto_Theft Break_and_Enter     Robbery Theft_Over
## 1 -0.03088279 -0.2256176       0.3005858  0.02643693 -0.3488020
## 2  0.55292426  2.4255005      -0.5324398  1.12469735  0.5700703
## 3  1.72830722 -0.6195701       1.6301725  1.22286588  2.0263084
## 4 -0.86308013 -0.3613840      -0.9200304 -0.98132438 -0.4804683
## 
## Clustering vector:
##  [1] 4 1 1 4 1 1 4 3 3 4 4 1 1 1 4 2 2 4 1 1 1 1 1 2 2 3 1 4 3 1 4 1 4 1
## 
## Within cluster sum of squares by cluster:
## [1] 22.932432 10.847959 12.848323  6.212482
##  (between_SS / total_SS =  68.0 %)
## 
## Available components:
## 
##  [1] "cluster"      "centers"      "totss"        "withinss"    
##  [5] "tot.withinss" "betweenss"    "size"         "iter"        
##  [9] "ifault"       "clust_plot"   "silinfo"      "nbclust"     
## [13] "data"
#plotting k-means
clusplot(hotspot, km.res$cluster, main='2D representation of the Cluster solution',
         color=TRUE, shade=TRUE,
         labels=2, lines=0)

#pc <-princomp(hotspot, cor=TRUE, scores=TRUE)
#plot3d(pc$scores[,1:3], col=k.means.fit$cluster, main="k-means clusters")

hotspot$cluster <- factor(km.res$cluster)

ggmap(tormap) +
  geom_point( data= torclus, aes(x=Long[], y=Lat[], color = hotspot$cluster)) +
  theme_void() + coord_map()
## Coordinate system already present. Adding new coordinate system, which will replace the existing one.

When we plot the clusters using principal component analysis in 2D, we can see that the clusters are relatively well separated. Cluster 1 and 4 are spread out, but clusters 2 and 3 are tighter. On the map, we can also see that cluster 4 is concentrated in downtown Toronto, while cluster 1 is in the northwest part.

#coerce data for Divisions
bygroup <- group_by(data, MCI, Division)
groups <- dplyr::summarise(bygroup, n=n())
groups <- groups[c("Division", "MCI", "n")]
div <- as.data.frame(spread(groups, key=MCI, value=n))
div_name = div[,1]
div <- div[, -1]

#normalize
for(col in names(div)) {
  div[,col] = (div[,col] - mean(div[,col])) / sd(div[,col])
}

#determine number of clusters
#we can see there's an elbow around 3 clusters
wssplot(div, nc=15)

# k-means
km.res <- eclust(div, "kmeans", k = 3, graph = T)

fviz_silhouette(km.res, palette = "jco", ggtheme = theme_classic())
##   cluster size ave.sil.width
## 1       1    8          0.46
## 2       2    9          0.21
## 3       3   17          0.96

km.res
## K-means clustering with 3 clusters of sizes 8, 9, 17
## 
## Cluster means:
##      Assault Auto_Theft Break_and_Enter    Robbery Theft_Over
## 1  0.4761876  0.1302555       0.5495307  0.3599759  0.5014688
## 2  1.2770657  1.2633142       1.1990178  1.3373206  1.1973892
## 3 -0.9001819 -0.7301101      -0.8933768 -0.8773937 -0.8698972
## 
## Clustering vector:
##  [1] 1 3 1 3 1 3 2 3 2 3 2 3 2 3 2 3 1 3 2 3 2 3 2 3 2 3 1 3 1 3 1 3 1 3
## 
## Within cluster sum of squares by cluster:
## [1]  6.9416751 17.1183040  0.1890451
##  (between_SS / total_SS =  85.3 %)
## 
## Available components:
## 
##  [1] "cluster"      "centers"      "totss"        "withinss"    
##  [5] "tot.withinss" "betweenss"    "size"         "iter"        
##  [9] "ifault"       "clust_plot"   "silinfo"      "nbclust"     
## [13] "data"
#similar to neighborhoods, two clusters have low crime incidents (1 and 2), while cluster 3 has higher crime incidents
# most districts are lower crime incident districts, while 9 specifically are higher
#can we map this and see if they correspond to neighborhoods?
#do the Toronto police have a dedicated district to each high crime neighborhood?

Mapping this is probably not possible

division_vs_hood_id = data[,c("Division","Hood_ID")]
table(division_vs_hood_id)
##          Hood_ID
## Division     1    2    3    4    5    6    7    8    9   10   11   12   13
##   D11        0    0    0    0    0    0    0    0    0    0    0    0    0
##   D11        0    0    0    0    0    0    0    0    0    0    0    0    0
##   D12        0    0    0    0    0    0    0    0    0    0    0    0    0
##   D12        0    0    0    0    0    0    0    0    0    0    0    0    0
##   D13        0    0    0    0    0    1    0    0    0    0    0    0    0
##   D13        0    0    0    0    0    0    0    0    0    0    0    0    0
##   D14        0    0    0    0    0    0    0    0    0    0    0    0    0
##   D14        0    0    0    0    0    0    0    0    0    0    0    0    0
##   D22        0    0    0    0    0    0   38   28  348  266  469  218  323
##   D22        0    0    0    0    0    0    5    2   15   15   30    7   15
##   D23     2774 1401  402  395  355  753  653  326    0    0    0    0    0
##   D23      173  117   27   34   30   46   41   34    0    0    0    0    0
##   D31        0    0    0    0    0    0    0    0    0    0    0    0    0
##   D31        0    0    0    0    0    0    0    0    0    0    0    0    0
##   D32        0    0    0    0    0    0    0    0    0    0    0    0    0
##   D32        0    0    0    0    0    0    0    0    0    0    0    0    0
##   D33        0    0    0    0    0    0    0    0    0    0    0    0    0
##   D33        0    0    0    0    0    0    0    0    0    0    0    0    0
##   D41        0    0    0    0    0    0    0    0    0    0    0    0    0
##   D41        0    0    0    0    0    0    0    0    0    0    0    0    0
##   D42        0    0    0    0    0    0    0    0    0    0    0    0    0
##   D42        0    0    0    0    0    0    0    0    0    0    0    0    0
##   D43        0    0    0    0    0    0    0    0    0    0    0    0    0
##   D43        0    0    0    0    0    0    0    0    0    0    0    0    0
##   D51        0    0    0    0    0    0    0    0    0    0    0    0    0
##   D51        0    0    0    0    0    0    0    0    0    0    0    0    0
##   D52        0    0    0    0    0    0    0    0    0    0    0    0    0
##   D52        0    0    0    0    0    0    0    0    0    0    0    0    0
##   D53        0    0    0    0    0    0    0    0    0    0    0    0    0
##   D53        0    0    0    0    0    0    0    0    0    0    0    0    0
##   D54        0    0    0    0    0    0    0    0    0    0    0    0    0
##   D54        0    0    0    0    0    0    0    0    0    0    0    0    0
##   D55        0    0    0    0    0    0    0    0    0    0    0    0    0
##   D55        0    0    0    0    0    0    0    0    0    0    0    0    0
##          Hood_ID
## Division    14   15   16   17   18   19   20   21   22   23   24   25   26
##   D11        0    0    0    1    0    0    0    0    0    0    0    0    0
##   D11        0    0    0    0    0    0    0    0    0    0    0    0    0
##   D12        0    0    0    0    0    0    0    0    0  221    0    0    0
##   D12        0    0    0    0    0    0    0    0    0    9    0    0    0
##   D13        0    0    0    0    0    0    0    0    0    0    0    0    0
##   D13        0    0    0    0    0    0    0    0    0    0    0    0    0
##   D14        0    0    0    1    0    0    0    0    0    0    0    0    0
##   D14        0    0    0    0    0    0    0    0    0    0    0    0    0
##   D22     1747  321  653 1242  569  378  312    0    0    0    0    0    0
##   D22       70    7   37   77   36   23   14    0    0    0    0    0    0
##   D23        0    0    0    0    0    0    0    4    0    0    0    0    0
##   D23        0    0    0    0    0    0    0    2    0    0    0    0    0
##   D31        0    0    0    0    0    0    0 1029  586  196 1289 1284 1685
##   D31        0    0    0    0    0    0    0   43   34   10   55   44   74
##   D32        0    0    0    0    0    0    0    0    0    0    0    0  206
##   D32        0    0    0    0    0    0    0    0    0    0    0    0    6
##   D33        0    0    0    0    0    0    0    0    0    0    0    0    0
##   D33        0    0    0    0    0    0    0    0    0    0    0    0    0
##   D41        0    0    0    0    0    0    0    0    0    0    0    0    0
##   D41        0    0    0    0    0    0    0    0    0    0    0    0    0
##   D42        0    0    0    0    0    0    0    0    0    0    0    0    0
##   D42        0    0    0    0    0    0    0    0    0    0    0    0    0
##   D43        0    0    0    0    0    0    0    0    0    0    0    0    0
##   D43        0    0    0    0    0    0    0    0    0    0    0    0    0
##   D51        0    0    0    0    0    0    0    0    0    0    0    0    0
##   D51        0    0    0    0    0    0    0    0    0    0    0    0    0
##   D52        0    0    0    0    0    0    0    0    0    0    0    0    0
##   D52        0    0    0    0    0    0    0    0    0    0    0    0    0
##   D53        0    0    0    0    0    0    0    0    0    0    0    0    0
##   D53        0    0    0    0    0    0    0    0    0    0    0    0    0
##   D54        0    0    0    0    0    0    0    0    0    0    0    0    0
##   D54        0    0    0    0    0    0    0    0    0    0    0    0    0
##   D55        0    0    0    0    0    0    0    0    0    0    0    0    0
##   D55        0    0    0    0    0    0    0    0    0    0    0    0    0
##          Hood_ID
## Division    27   28   29   30   31   32   33   34   35   36   37   38   39
##   D11        0    0    0    0    0    0    0    0    0    0    0    0    0
##   D11        0    0    0    0    0    0    0    0    0    0    0    0    0
##   D12        0  480  298  558    0    0    0    0    0    0    0    0    0
##   D12        0   14   12   15    0    0    0    0    0    0    0    0    0
##   D13        0    0    0    0  351  148    0    0    0    0    0    0   33
##   D13        0    0    0    0   11    5    0    0    0    0    0    0    0
##   D14        0    0    0    0    0    0    0    0    0    0    0    0    0
##   D14        0    0    0    0    0    0    0    0    0    0    0    0    0
##   D22        0    0    0    0    0    0    0    0    0    0    0    0    0
##   D22        0    0    0    0    0    0    0    0    0    0    0    0    0
##   D23        0    0    0    0    0    0    0    0    0    0    0    0    0
##   D23        0    0    0    0    0    0    0    0    0    0    0    0    0
##   D31     1665    0    0    0    0    0    0    0    0    0    0    0    0
##   D31       68    0    0    0    0    0    0    0    0    0    0    0    0
##   D32      484    0    0    0  682  451  526  609  575  705  358  438  592
##   D32        9    0    0    0   28   15    8   18   12   25   10   11   22
##   D33        0    0    0    0    0    0    0    0    0    0    0    0    0
##   D33        0    0    0    0    0    0    0    0    0    0    0    0    0
##   D41        0    0    0    0    0    0    0    0    0    0    0    0    0
##   D41        0    0    0    0    0    0    0    0    0    0    0    0    0
##   D42        0    0    0    0    0    1    0    0    0    0    0    0    0
##   D42        0    0    0    0    0    0    0    0    0    0    0    0    0
##   D43        0    0    0    0    0    0    0    0    0    0    0    0    0
##   D43        0    0    0    0    0    0    0    0    0    0    0    0    0
##   D51        0    0    0    0    0    0    0    0    0    0    0    0    0
##   D51        0    0    0    0    0    0    0    0    0    0    0    0    0
##   D52        0    0    0    0    0    0    0    0    0    0    0    0    0
##   D52        0    0    0    0    0    0    0    0    0    0    0    0    0
##   D53        0    0    0    0    0    0    0    0    0    0    0    0  169
##   D53        0    0    0    0    0    0    0    0    0    0    0    0    3
##   D54        0    0    0    0    0    0    0    0    0    0    0    0    0
##   D54        0    0    0    0    0    0    0    0    0    0    0    0    0
##   D55        0    0    0    0    0    0    0    0    0    0    0    0    0
##   D55        0    0    0    0    0    0    0    0    0    0    0    0    0
##          Hood_ID
## Division    40   41   42   43   44   45   46   47   48   49   50   51   52
##   D11        0    0    0    0    0    0    0    0    0    0    0    0    0
##   D11        0    0    0    0    0    0    0    0    0    0    0    0    0
##   D12        0    0    0    0    0    0    0    0    0    0    0    0    0
##   D12        0    0    0    0    0    0    0    0    0    0    0    0    0
##   D13        0    0    0    0    0    0    0    0    0    0    0    0    0
##   D13        0    0    0    0    0    0    0    0    0    0    0    0    0
##   D14        0    0    0    0    0    0    0    0    0    0    0    0    0
##   D14        0    0    0    0    0    0    0    0    0    0    0    0    0
##   D22        0    0    0    0    0    0    0    0    0    0    0    0    0
##   D22        0    0    0    0    0    0    0    0    0    0    0    0    0
##   D23        0    0    0    0    0    0    0    0    0    0    0    0    0
##   D23        0    0    0    0    0    0    0    0    0    0    0    0    0
##   D31        0    0    0    0    0    0    0    0    0    0    0    0    0
##   D31        0    0    0    0    0    0    0    0    0    0    0    0    0
##   D32      328  118    0    0    0    0    0    0    0    5  881 1186   32
##   D32        3    0    0    0    0    0    0    0    0    0   24   52    0
##   D33      335   46  633  194    0  946  290  885  457  296    0    1  531
##   D33        9    3   25    6    0   33    6   26   12   14    0    0   14
##   D41        0    0    0    0    0    0    0    0    0    0    0    0    0
##   D41        0    0    0    0    0    0    0    0    0    0    0    0    0
##   D42        0    0    0    0    0    0    0    0    0    0    0    0    0
##   D42        0    0    0    0    0    0    0    0    0    0    0    0    0
##   D43        0    0    0    0    0    0    0    0    0    0    0    0    0
##   D43        0    0    0    0    0    0    0    0    0    0    0    0    0
##   D51        0    0    0    0    0    0    0    0    0    0    0    0    0
##   D51        0    0    0    0    0    0    0    0    0    0    0    0    0
##   D52        0    0    0    0    0    0    0    0    0    0    0    0    0
##   D52        0    0    0    0    0    0    0    0    0    0    0    0    0
##   D53        0  125    0    0    0    0    0    0    0    0    0    0    0
##   D53        0    8    0    0    0    0    0    0    0    0    0    0    0
##   D54        0    0    3  216  652    0    0    0    0    0    0    0    0
##   D54        0    0    0    9   33    0    0    0    0    0    0    0    0
##   D55        0    0    0    0    0    0    0    0    0    0    0    0    0
##   D55        0    0    0    0    0    0    0    0    0    0    0    0    0
##          Hood_ID
## Division    53   54   55   56   57   58   59   60   61   62   63   64   65
##   D11        0    0    0    0    0    0    0    0    0    0    0    0    0
##   D11        0    0    0    0    0    0    0    0    0    0    0    0    0
##   D12        0    0    0    0    0    0    0    0    0    0    0    0    0
##   D12        0    0    0    0    0    0    0    0    0    0    0    0    0
##   D13        0    0    0    0    0    0    0    0    0    0    0    0    0
##   D13        0    0    0    0    0    0    0    0    0    0    0    0    0
##   D14        0    0    0    0    0    0    0    0    0    0    0    0    0
##   D14        0    0    0    0    0    0    0    0    0    0    0    0    0
##   D22        0    0    0    0    0    0    0    0    0    0    0    0    0
##   D22        0    0    0    0    0    0    0    0    0    0    0    0    0
##   D23        0    0    0    0    0    0    0    0    0    0    0    0    0
##   D23        0    0    0    0    0    0    0    0    0    0    0    0    0
##   D31        0    0    0    0    0    0    0    0    0    0    0    0    0
##   D31        0    0    0    0    0    0    0    0    0    0    0    0    0
##   D32        0    0    0    0    0    0    0    0    0    0    0    0    0
##   D32        0    0    0    0    0    0    0    0    0    0    0    0    0
##   D33      315    0    0    0    0    0    0    0    0    0    0    0    0
##   D33       20    0    0    0    0    0    0    0    0    0    0    0    0
##   D41        0    6    0    0    0    0    0    0    0    0    1    0    0
##   D41        0    0    0    0    0    0    0    0    0    0    0    0    0
##   D42        0    0    0    0    0    0    0    0    0    0    0    0    0
##   D42        0    0    0    0    0    0    0    0    0    0    0    0    0
##   D43        0    0    0    0    0    0    0    0    0    0    0    0    0
##   D43        0    0    0    0    0    0    0    0    0    0    0    0    0
##   D51        0    0    0    0    0    0    0    0    0    0    0    0    0
##   D51        0    0    0    0    0    0    0    0    0    0    0    0    0
##   D52        0    0    0    0    0    0    0    0    0    0    0    0    0
##   D52        0    0    0    0    0    0    0    0    0    0    0    0    0
##   D53        0    0  496  330    0    0    0    0    0    0    0    0    0
##   D53        0    0   23   14    0    0    0    0    0    0    0    0    0
##   D54        0  661    0    0  209  373  297  294  575  297    0    0    0
##   D54        0   27    0    0    8   23   19   10   24   18    0    0    0
##   D55        0    0    0    0    0    0    0    0    1  847  559  421  580
##   D55        0    0    0    0    0    0    0    0    0   31   23   23   17
##          Hood_ID
## Division    66   67   68   69   70   71   72   73   74   75   76   77   78
##   D11        0    0    0    0    0    0    0    0    0    0    0    1    0
##   D11        0    0    0    0    0    0    0    0    0    0    0    0    0
##   D12        0    0    0    0    0    0    0    0    0    1    0    2    0
##   D12        0    0    0    0    0    0    0    0    0    0    0    0    0
##   D13        0    0    0    0    0    0    0    0    0    0    0    0    0
##   D13        0    0    0    0    0    0    0    0    0    0    0    0    0
##   D14        0    0    0    0    0    0    0    0    0    0    0 1153 1050
##   D14        0    0    0    0    0    0    0    0    0    0    0   45   36
##   D22        0    0    0    1    0    0    0    0    0    0    0    1    0
##   D22        0    0    0    0    0    0    0    0    0    0    0    0    0
##   D23        0    0    0    0    0    0    0    0    0    0    0    0    0
##   D23        0    0    0    0    0    0    0    0    0    0    0    0    0
##   D31        0    0    0    0    0    0    0    0    0    0    0    0    0
##   D31        0    0    0    0    0    0    0    0    0    0    0    0    0
##   D32        0    0    0    0    0    0    0    0    0    0    0    0    0
##   D32        0    0    0    0    0    0    0    0    0    0    0    0    0
##   D33        0    0    0    0    0    0    0    0    0    0    0    0    0
##   D33        0    0    0    0    0    0    0    0    0    0    0    0    0
##   D41        0    0    0    0    0    0    0    0    0    0    0    0    0
##   D41        0    0    0    0    0    0    0    0    0    0    0    0    0
##   D42        0    0    0    0    0    0    0    0    0    0    0    0    0
##   D42        0    0    0    0    0    0    0    0    0    0    0    0    0
##   D43        0    0    0    0    0    0    0    0    0    0    0    0    0
##   D43        0    0    0    0    0    0    0    0    0    0    0    0    0
##   D51        0    0    0    0  124  752  635 2213  981 2713    0  539    0
##   D51        0    0    0    0   12   23   27   92   49  110    0   19    0
##   D52        0    0    0    0    0    0    0    0    0 1099 2246 1840  929
##   D52        0    0    0    0    0    0    0    0    0   57  106   86   40
##   D53        0    0    0    0    0    0    0    0    0    0    0    0    0
##   D53        0    0    0    0    0    0    0    0    0    0    0    0    0
##   D54      576  382    0    0    0    0    0    0    0    0    0    0    0
##   D54       22   12    0    0    0    0    0    0    0    0    0    0    0
##   D55      196  101  358  295 1122    0    0    0    0    0    0    0    0
##   D55        5    4    7    9   42    0    0    0    0    0    0    0    0
##          Hood_ID
## Division    79   80   81   82   83   84   85   86   87   88   89   90   91
##   D11        0    0    0    0  249  122   41  710  445  697  368  521  284
##   D11        0    0    0    0    9    6    3   35   14   18   13   26   13
##   D12        0    0    0    0    0    0    0    0    0    0    0  132  177
##   D12        0    0    0    0    0    0    0    0    0    0    0    2    3
##   D13        0    0    0    0    0    0    0    0    0    0    0    0    0
##   D13        0    0    0    0    0    0    0    0    0    0    0    0    0
##   D14      546  552  819  908  128  336  803  211    0    0    0    0    0
##   D14       18   12   29   26    2   18   32    4    0    0    0    0    0
##   D22        0    0    0    0    0    0    0    0    0    0    0    0    0
##   D22        0    0    0    0    0    0    0    0    0    0    0    0    0
##   D23        0    0    0    0    0    0    0    0    0    0    0    0    0
##   D23        0    0    0    0    0    0    0    0    0    0    0    0    0
##   D31        0    0    0    0    0    0    0    0    0    0    0    0    0
##   D31        0    0    0    0    0    0    0    0    0    0    0    0    0
##   D32        0    0    0    1    0    0    0    0    0    0    0    0    0
##   D32        0    0    0    0    0    0    0    0    0    0    0    0    0
##   D33        0    0    0    0    0    0    0    0    0    0    0    0    0
##   D33        0    0    0    0    0    0    0    0    0    0    0    0    0
##   D41        0    0    0    0    0    0    0    0    0    0    0    0    0
##   D41        0    0    0    0    0    0    0    0    0    0    0    0    0
##   D42        0    0    0    0    0    0    0    0    0    0    0    0    0
##   D42        0    0    0    0    0    0    0    0    0    0    0    0    0
##   D43        0    0    2    0    0    0    0    0    1    0    0    0    0
##   D43        0    0    0    0    0    0    0    0    0    0    0    0    0
##   D51        0    0    0    0    0    0    0    0    0    0    0    0    0
##   D51        0    0    0    0    0    0    0    0    0    0    0    0    0
##   D52      292    0    0    0    0    0    0    0    0    0    0    0    0
##   D52       17    0    0    0    0    0    0    0    0    0    0    0    0
##   D53        0    0    0    0    0    0    0    0    0    0    0    0    0
##   D53        0    0    0    0    0    0    0    0    0    0    0    0    0
##   D54        0    0    0    0    0    0    0    0    0    0    0    0    0
##   D54        0    0    0    0    0    0    0    0    0    0    0    0    0
##   D55        0    0    0    0    0    0    0    0    0    0    0    0    0
##   D55        0    0    0    0    0    0    0    0    0    0    0    0    0
##          Hood_ID
## Division    92   93   94   95   96   97   98   99  100  101  102  103  104
##   D11        0  803    0    0    0    0    0    0    0    0    0    0    0
##   D11        0   32    0    0    0    0    0    0    0    0    0    0    0
##   D12        0    0    0    0    0    0    0    0    0    0    0    0    0
##   D12        0    0    0    0    0    0    0    0    0    0    0    0    0
##   D13      498   94  414    0  238    0    0    0    0  172  233    0    0
##   D13       31    2   13    0   11    0    0    0    0    7   12    0    0
##   D14        0  633    0  858    0    0    0    0    0    0    0    0    0
##   D14        0   28    0   30    0    0    0    0    0    0    0    0    0
##   D22        0    0    0    0    0    0    0    0    0    0    0    0    0
##   D22        0    0    0    0    0    0    0    0    0    0    0    0    0
##   D23        0    0    0    0    0    0    0    0    0    0    0    0    0
##   D23        0    0    0    0    0    0    0    0    0    0    0    0    0
##   D31        0    0    0    0    0    0    0    0    0    0    0    0    0
##   D31        0    0    0    0    0    0    0    0    0    0    0    0    0
##   D32        0    0    0    0    0    0    0    0    0    0    0    0    0
##   D32        0    0    0    0    0    0    0    0    0    0    0    0    0
##   D33        0    0    0    0    0    0    0    0    0    0    0    0    0
##   D33        0    0    0    0    0    0    0    0    0    0    0    0    0
##   D41        0    0    0    0    0    0    0    0    0    0    0    0    0
##   D41        0    0    0    0    0    0    0    0    0    0    0    0    0
##   D42        0    0    0    0    0    0    0    0    0    0    0    0    0
##   D42        0    0    0    0    0    0    0    0    0    0    0    0    0
##   D43        0    0    0    0    0    0    0    0    0    0    0    0    0
##   D43        0    0    0    0    0    0    0    0    0    0    0    0    0
##   D51        0    0    0    0    0    0  207    0    0    0    0    0    0
##   D51        0    0    0    0    0    0    9    0    0    0    0    0    0
##   D52        0    0    0  137    0    0    3    0    0    0    0    0    0
##   D52        0    0    0    5    0    0    0    0    0    0    0    0    0
##   D53        0    0    0  899   71  169  814  263  222   80  133  338  754
##   D53        0    0    0   38    2    2   33   11    5    3    2    9   30
##   D54        0    0    0    0    0    0    0    0    0    0    0    0    0
##   D54        0    0    0    0    0    0    0    0    0    0    0    0    0
##   D55        0    0    0    0    0    0    0    0    0    0    0    0    0
##   D55        0    0    0    0    0    0    0    0    0    0    0    0    0
##          Hood_ID
## Division   105  106  107  108  109  110  111  112  113  114  115  116  117
##   D11        0    0    0    0    0    0   36    0    0  122    0    0    0
##   D11        0    0    0    0    0    0    1    0    0    3    0    0    0
##   D12        0    0    0    1    0  344  781  406  959    0  578    0    0
##   D12        0    0    0    0    0    7   32   12   43    0   35    0    0
##   D13        0  263  528  721  210    1    0    0    0    0    0    0    0
##   D13        0   15   26   30   12    0    0    0    0    0    0    0    0
##   D14        0    0    0    0    0    0    0    0    0    0    0    0    0
##   D14        0    0    0    0    0    0    0    0    0    0    0    0    0
##   D22        0    0    0    0    0    0    0    0    0    0    0    0    0
##   D22        0    0    0    0    0    0    0    0    0    0    0    0    0
##   D23        0    0    0    0    0    0    0    0    1    0    0    0    0
##   D23        0    0    0    0    0    0    0    0    0    0    0    0    0
##   D31        0    0    0    0    0    0    0    0    0    0    0    0    0
##   D31        0    0    0    0    0    0    0    0    0    0    0    0    0
##   D32      314    0    0    0    0    0    0    0    0    0    0    0    0
##   D32        8    0    0    0    0    0    0    0    0    0    0    0    0
##   D33        0    0    0    0    0    0    0    0    0    0    0   29   53
##   D33        0    0    0    0    0    0    0    0    0    0    0    3    0
##   D41        0    0    0    0    0    0    0    0    0    0    0    0    0
##   D41        0    0    0    0    0    0    0    0    0    0    0    0    0
##   D42        0    0    0    0    0    0    0    0    0    0    0  458  981
##   D42        0    0    0    0    0    0    0    0    0    0    0   15   27
##   D43        0    0    0    0    0    0    0    0    0    0    0    0    0
##   D43        0    0    0    0    0    0    0    0    0    0    0    0    0
##   D51        0    0    0    0    0    1    0    0    1    0    0    0    0
##   D51        0    0    0    0    0    0    0    0    0    0    0    0    0
##   D52        0    0    0    0    0    0    0    0    0    0    0    0    0
##   D52        0    0    0    0    0    0    0    0    0    0    0    0    0
##   D53       20    0    0    0    0    0    0    0    0    0    0    0    0
##   D53        0    0    0    0    0    0    0    0    0    0    0    0    0
##   D54        0    0    0    0    0    0    0    0    0    0    0    0    0
##   D54        0    0    0    0    0    0    0    0    0    0    0    0    0
##   D55        0    0    0    0    0    0    0    0    0    0    0    0    0
##   D55        0    0    0    0    0    0    0    0    0    0    0    0    0
##          Hood_ID
## Division   118  119  120  121  122  123  124  125  126  127  128  129  130
##   D11        0    0    0    0    0    0    0    0    0    0    0    0    0
##   D11        0    0    0    0    0    0    0    0    0    0    0    0    0
##   D12        0    0    0    0    0    0    0    0    0    0    0    0    0
##   D12        0    0    0    0    0    0    0    0    0    0    0    0    0
##   D13        0    0    0    0    0    0    0    0    0    0    0    0    0
##   D13        0    0    0    0    0    0    0    0    0    0    0    0    0
##   D14        0    0    0    0    0    0    0    0    0    0    0    0    0
##   D14        0    0    0    0    0    0    0    0    0    0    0    0    0
##   D22        0    0    0    0    0    0    0    0    0    0    0    0    0
##   D22        0    0    0    0    0    0    0    0    0    0    0    0    0
##   D23        0    0    0    0    0    0    0    0    0    0    0    0    0
##   D23        0    0    0    0    0    0    0    0    0    0    0    0    0
##   D31        0    0    0    0    0    0    0    0    0    0    0    0    0
##   D31        0    0    0    0    0    0    0    0    0    0    0    0    0
##   D32        0    0    0    0    0    0    0    0    0    0    0    0    0
##   D32        0    0    0    0    0    0    0    0    0    0    0    0    0
##   D33        2  164    0    0    0    0    0    0    0    0    0    0    0
##   D33        1    6    0    0    0    0    0    0    0    0    0    0    0
##   D41        0 1387 1311  830  921  238 1040  555 1159  640    0    0    0
##   D41        0   33   83   35   45    7   37   28   32   23    0    0    0
##   D42      658    0    0    0    0    0    0    0    0    0  886  659  962
##   D42       33    0    0    0    0    0    0    0    0    0   33   17   21
##   D43        0    1    0    0    0  402    0    0    0  805    0    0    0
##   D43        0    1    0    0    0   33    0    0    0   59    0    0    0
##   D51        0    0    0    0    0    0    0    0    0    0    0    0    0
##   D51        0    0    0    0    0    0    0    0    0    0    0    0    0
##   D52        0    0    0    0    0    0    0    0    0    0    0    0    0
##   D52        0    0    0    0    0    0    0    0    0    0    0    0    0
##   D53        0    0    0    0    0    0    0    0    0    0    0    0    0
##   D53        0    0    0    0    0    0    0    0    0    0    0    0    0
##   D54        0    1  107   18    0    0    1    0    0    0    0    0    0
##   D54        0    0   10    0    0    0    0    0    0    0    0    0    0
##   D55        0    0    0   82   16    0    0    0    0    0    0    0    0
##   D55        0    0    0    3    1    0    0    0    0    0    0    0    0
##          Hood_ID
## Division   131  132  133  134  135  136  137  138  139  140
##   D11        0    0    0    0    0    0    0    0    0    0
##   D11        0    0    0    0    0    0    0    0    0    0
##   D12        0    0    0    0    0    0    0    0    0    0
##   D12        0    0    0    0    0    0    0    0    0    0
##   D13        0    0    0    0    0    0    0    0    0    0
##   D13        0    0    0    0    0    0    0    0    0    0
##   D14        0    0    0    0    0    0    0    0    0    0
##   D14        0    0    0    0    0    0    0    0    0    0
##   D22        0    0    0    0    0    0    0    0    0    0
##   D22        0    0    0    0    0    0    0    0    0    0
##   D23        0    0    0    0    0    0    0    0    0    0
##   D23        0    0    0    0    0    0    0    0    0    0
##   D31        0    0    0    0    0    0    0    0    0    0
##   D31        0    0    0    0    0    0    0    0    0    0
##   D32        0    0    0    0    0    0    0    0    0    0
##   D32        0    0    0    0    0    0    0    0    0    0
##   D33        0    0    0    0    0    0    0    0    0    0
##   D33        0    0    0    0    0    0    0    0    0    0
##   D41        0    0    0    0    0    0    0  543    0    0
##   D41        0    0    0    0    0    0    0   16    0    0
##   D42     1033 1283    1    0    0    0    0    0    0    0
##   D42       41   58    0    0    0    0    0    0    0    0
##   D43      177    0  200  549  561 1741 2011  627  795  266
##   D43        9    0   11   36   44  124  125   41   41   12
##   D51        0    0    0    0    0    0    0    0    0    0
##   D51        0    0    0    0    0    0    0    0    0    0
##   D52        0    0    0    0    0    0    0    0    0    0
##   D52        0    0    0    0    0    0    0    0    0    0
##   D53        0    0    0    0    0    0    0    0    0    0
##   D53        0    0    0    0    0    0    0    0    0    0
##   D54        0    0    0    0    0    0    0    0    0    0
##   D54        0    0    0    0    0    0    0    0    0    0
##   D55        0    0    0    0    0    0    0    0    0    0
##   D55        0    0    0    0    0    0    0    0    0    0

(3) Clustering strategy III.

VI. Evaluation

VII. Deployment

VIII. Conclusions